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Abstract 

Multithreshold Entropy Linear Classifier (MELC) is a density based 
model which searches for a linear projection maximizing the Cauchy- 
Schwarz Divergence of dataset kernel density estimation. Despite its 
good empirical results, one of its drawbacks is the optimization speed. 
In this paper we analyze how one can speed it up through solving an 
approximate problem. We analyze two methods, both similar to the 
approximate solutions of the Kernel Density Estimation querying and 
provide adaptive schemes for selecting a crucial parameters based on 
user-specified acceptable error. Furthermore we show how one can ex¬ 
ploit well known conjugate gradients and L-BFGS optimizers despite 
the fact that the original optimization problem should be solved on 
the sphere. All above methods and modifications are tested on 10 real 
life datasets from UCI repository to confirm their practical usability. 

1 Introduction 

Many methods of speeding up the kernel density estimator’s (KDE) 
querying process has been proposed in the literature As op- 
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timization problem introduced in Multithreshold Entropy Linear Clas¬ 
sifier [5] is closely related to the equations of KDE it appears natural 
that similar techniques can be used to simplify its computations with 
a bounded error. Importance of such reductions comes from the high 
(quadratic) complexity of the evaluation of functions required during 
training of this model which makes it hard to use for any dataset with 
more than a thousand points. In this paper we investigate two such 
approaches, first - sorting and discarding, which ignores computations 
of similarities between points that are too far away to have big impact 
on the function’s value, second - binning, which smooths the func¬ 
tion construction in order to heavily reduce amount of unique points. 
Both these methods are introduced in an adaptive manner so the opti¬ 
mization process have fixed error bound despite many different linear 
projections being analyzed during the training phase. We also show a 
very simple method which enables to use a wide range of optimization 
algorithms even though proposed model requires optimization with a 
specific constraints (sphere bounded). 

2 Multithreshold Entropy Linear Clas¬ 
sifier 

Multithreshold Entropy Linear Classifier (MELC [5]) has been re¬ 
cently proposed as an information theoretic approach for building 
model from the multithreshold linear family [1]. It’s core idea is 
to find a linear operator v (with unit norm) such that kernel den¬ 
sity estimations of projected classes’ training samples maximize the 
Cauchy-Schwarz Divergence (Dcs [S])- Let us recall the equation of 
Dcs in order to find the core computational bottleneck which appears 
in MELC optimization 

Dcs(/-,/+) = 2H2V/-,/+) - H 2 (/-) - H 2 (/+), 

for f± = being a kernel density estimator of v'^X± with Silver¬ 

man’s rule im, thus from the definition of Renyi’s quadratic entropy, 
Renyi’s quadratic cross entropy and the fact that ip^(/, ^) = / /^ we 
have 

Dcs(/-,/+) = -21ogip''(/_,/+) + logip''(/_,/-) + logip^(/+,/+)- 

As whole Dcs function is composed of ip^ evaluations, in the rest of 
our paper we focus purely on the ip^, which we expand using Gaussian 
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kernel density estimation m and denote ip^ ('L’) = ip^(|'L’^X_], |'z;^X+]). 


ip"W = 


(^y^ X — ^ 

■/2nV(v)\X\\Y\ ^I 2V(v) 


E 


where V(v) is a sum of each classes estimated variances using Silver¬ 
man’s rule HU. 

In an obvious way, naive computation of ip^ is (9(A^^), where 
N = max{|X_|, |X+|} due to the summation over all possible pairs 
(x, y). In the following sections we focus on methods which reduce this 
computational bottleneck while still preserving given approximation 
of ip^ value. 


3 Reduction of ip^ computational com¬ 
plexity 

Sorting and discarding Let us begin with the very simple con¬ 
ception of computing values of only those {x^y) pairs which are close 
enough to have an impact on the value of ip^. If we assume that points 
projections are sorted (which can be done in general in (9(A^logA^)[^ 
we can search the dataset in linear time and identify for each point x 
indices of first and last point which are at most at distance T from x. 
Following theorem shows what T to choose in order to obtain at most 
6 error. 

Theorem 1. Using adaptive sorting and discarding with distance thresh¬ 
old in each iteration of at least 

y max |o, -y (v) In (2{^r7rV (C) 

where V (v) is a sum of each classes estimated variances, leads to the 
computation of the ip^ function with at most e error, assuming that 
at most fraction of p points is located closer than T. 

Proof. We assume that \{v,x — y)\ > T for Ny pairs of points which 
are being ignored during computation of ip^ so —{v,x — y)‘^ < —T^, 

^in fact for iterative optimization techniques points ordering does not change much 
between subsequent calls so after initial sorting it can be done in linear time using insertion 
sort 
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thus 


1 


{v,x- y)^ 
2V{v) 


< 


V^m^)\X\\Y\ 


v'2U^|x||y| 

E-pf- 




Nt 


2Viv)J ^2 t:V{v)\X\\Y\ 


exp 


T2 \ 

m^))- 


If we look for an e approximation of non-regularized MELC objective 
we put 0 < p = Nt/{\X\\Y\) < 1 and consequently 
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T2 \ 

2^7 


thus 

> -2V{v) ln(|V27ry(u)) , 

obviously if in (^^^y2^^V{v)j > 0 then any T satisfies this inequality 
(as it can only happen if we choose very big acceptable error e), so for 
simplicity we add the maximum of this value with 0. 

T > ^max |o, —2V{v) In {y)^ = ^max |o, —V(v) In (^2(^)‘^7rV(v 

□ 

Binning While sorting and discarding technique is quite easy to im¬ 
plement and analyze its practical speedup might be limited for densely 
packed datasets. In such cases it might be more valuable to perform a 
binning of our projected points, so those located near each other are 
approximated by their empirical mean. Such an approach works well 
for densely packed datasets which makes it a complementary approach 
to the previous one. 

Let us assume that we have some partitioning of the M = |J^=i 
where each ai is an interval. We define a binning operator as b{x) = 
mean{x G v^X H where x G We use following notation 

for simplicity {v^x)i) — b{{v^x)) .Similarly to the previous strategy, 
in order to preserve good approximation, bins width [B — max^ \ai\) 
needs to be adapted in each iteration and the exact equation is given 
in the following theorem. 
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Theorem 2. Using adaptive binning technique with bin width in each 
iteration at most 


—2V{v) In ^max |o, 1 — e^2 'kV{ v)^ 

where V (v) is a sum of each classes estimated variances, leads to the 
computation of the ip^ function with at most e error. 

Proof, we assume that \{v,x — y) — {{v,x)i) — {v,y)i))\ < B so 
1 / {{v,x)b- {v,y)bf 


ip"(e 


^2W{r)\X\\Y\ 


E 

x,y 


exp 


2V{v) 


1 


E 



{v,x- y)‘^ \ 
2Viv) ] 


— exp 


{{v,x)b - {v,y)b) 

2V(v) 



1 


E 


exp (0) 


— exp 



1 

^/27TV (v) 


exp 



Let us now assume that we are given some acceptable error e > 0. We 
will show how small bins have to be used based on our dataset and 
current projection. 


1 

\/2tiV{v) 


1 


— exp 



< 


but exp 



< 1, so 


thus 


1 

\/27tV (v) 


1 — exp 



< 


Naturally if 1 — e-\/27rV(v) < 0 then any B satisfies this inequality 
(similarly to the sorting and discarding method, it may only happen 
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if we choose very large acceptable error e) so we introduce maximum 
function here. 


~2V(^ - - e^/27^V{v)^^ 

B < ^—2V{v) In ^max |o, 1 — 

□ 

Figurej^shows how these two bounds behave with increasing size of 
the acceptable error. In particular one can see that both methods have 
very similar growth (up to the maximization/minimization symmetry) 
with changing e. As a result, due to the fact that binning is much 
more aggressive technique we should expect that using these bounds 
as the actual bin width/discarding threshold will lead to much greater 
reduction of the computational complexity when using binning. 




Figure 1: Plots of the values of the discarding threshold (on the left) and bin 
width (on the right) as the function of the acceptable error e. 


4 Out of sphere optimization 

Now we are going to show, that MELC objective function can be 
efficiently optimized in the whole space by adding some custom 
regularization term. The importance of this result is the fact that it 
enables us to use vast amount of existing optimization techniques (such 
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as Adaptive gradient descent, Conjugate Gradients, BFGS, L-BFGS 
etc.) without adapting them to the sphere constraints. The second 
important aspect is the fact that this modification does not involve 
adding any additional constants which have to be fitted. Following 
theorem describes modified objective function. 

Theorem 3. Given arbitrary sets C and eorresponding 

Dcsi'^) = funetion we have: 

d := max Dcsi"^) = ~ 1)^ 


and 


{v : ll^ll = 1 A Dcs{v) = d} = {v : Dcs{v) - (||^|P - 1)^ = d}. 

Proof. According to [5], Dcs is scale invariant so for any v G M^, c G 

M+ 

Dcs('^) = ^cs{cv)> 

As a result also 

Dcs(^) - {\\vf - 1)^ = Dcs(c^) - {\\vf - 1)^ 

but as —(ll'^^lP — 1)^ < 0 and —(||'?^|P — 1)^ = 0 <^=4> ||'L’|| = 1 we have 
that Dcs('^) ~ (ll'^lP ~ 1)^ is maximized for v with norm 1 and that it 
is equal to Dcs('^)- As a result sets of solutions of both problems are 
identical. 

□ 

Consequently we can apply any advanced optimization technique 
which is not designed to work on the sphere to optimize Dcs criterion. 
In particular we can use L-BFGS [3] instead of more complex and 
less popular RBFGS m and previously proposed [5] less efficient - 
gradient descent on sphere method. At the same time the norm of 
the candidate solution will stay close to 1 so we will not suffer from 
numerical problems [5]. 

It is worth noting that despite similarity to the L 2 regulariza¬ 
tion m of the additive loss function (or weight decay from neural 
networks) this additional terms serves no regularization purposes nor 
it affects the actual function value. It only guides the gradient based 
optimizers towards more informative regions of the state space. 
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From the practical point of view we also need a gradient of the 
new function but thanks to the additivity of derivative operator we 
get 


V [Dcs(e - (Ibf - 1)"] = [VDcs(^;)] - 4.v{{v,v) - 1), 

and we can use any optimization software able to maximize a function 
given (/,V/)- 


5 Evaluation 

We evaluate proposed approximations on 10 datasets from UCI repos¬ 
itory [2] and libSVM’s repository m c!- Both Dcs and approxi¬ 
mations are coded in Python using numpy and scipy [ 8 ]. We use 
scipy’s optimization module to perform training of all models us¬ 
ing two optimization techniques - Conjugate Gradients (CG) and L- 
BFGS-B [3j. Each experiment is performed in cross validation man¬ 
ner with multiple starting points (randomly selected, but constant 
across methods to achieve comparable results) due to the conver¬ 
gence of MELC optimization to local optima. We analyze 7 hyper¬ 
parameter of Dcs ill [0.1, 0.5,1.0,1.5, 2.0] and acceptable error e G 
[0.01,0.02,0.03,0.05,0.1,0.2,0.5]. Similarly to the original paper we 
use Balanced Accuracy (BACQ as the measure of classification cor¬ 
rectness due to MELC highly balanced formulation. 

First, we investigate how big is mean reduction of computations 
using each of the approximating schemes. Table reports mean ratio 
of exp function calls (which is equivalent to number of pairs analyzed 
in each ip^ evaluation when optimizing whole Dqs function and its 
gradient) in given method to the original implementation. 

One can easily notice that sorting and discarding method (denoted 
as ”dist”) roughly halves the number of analyzed pairs, while binning 
(denoted as ’’bin”) reduces it 3-10 times. It is an obvious consequence 
of the fact that binning is much more aggressive method. It appears 
that strength of reduction depends only on the dataset, not on the 
optimization algorithm used which suggests, that projections for which 
particular level of possible reduction are uniformly distributed over the 
space of all projections. These effects are also heavily dependenfj^ on 

2o AP - I ( TP , TN ^ 

— 2 \^tP+FN ^ TN+FP ) 

^we do not include the exact values in the Table for better readability 
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method 

name 

CG 

bin dist 

L-BFGS-B 
bin dist 

australian 

0.11 

0.44 

0.11 

0.45 

breast-cancer 

0.10 

0.46 

0.10 

0.46 

diabetes 

0.21 

0.56 

0.22 

0.54 

fourclass 

0.19 

0.51 

0.19 

0.49 

german.numer 

0.15 

0.47 

0.19 

0.46 

heart 

0.29 

0.47 

0.26 

0.47 

ionosphere 

0.25 

0.55 

0.24 

0.54 

liver-disorders 

0.29 

0.65 

0.31 

0.67 

sonar 

0.32 

0.53 

0.29 

0.50 

splice 

0.19 

0.44 

0.16 

0.43 


Table 1: Mean ratio of exp calls between approximated technique and original 
method during optimizations. 


the choice of 7 and c which is the obvious consequence of Theorems 1 
and 2 saying that with increasing variance (which is proportional to 
7 ^) the reduction strength decreases superlinearly. 

The set of heat maps in Figure shows differences between BAG 
obtained by the original Dcs and each approximation for a given 
dataset and 7 , e hyperparameters pair. In general, up to few isolated 
cases errors are on the level of 0.5% — 3%. For small 7 values er¬ 
rors introduced by the approximation are significantly higher and for 
sonar and splice datasets can grow to even 10%. Fortunately, these are 
very rare phenomena. Even more interesting is the fact that for many 
experiments we actually noticed increase in the BAG score (bluish el¬ 
ements). This might be the consequence of more rough evaluation of 
the function (and gradient) values leading to optimization less prone 
to falling into local maxima. Our hypothesis is that it acts like a 
regularization helping to train MELG model. 

Analysis of the number of iterations of each optimization method 
required to converge (see Table shows that both approximations 
significantly simplify the problem. It is important to notice that the 
number of iterations is not the number of Dcs function evaluations (as 
both Gonjugate Gradients and L-BFGS-B evaluate it multiple times in 
each iteration, especially during line searches). Gonsequently, number 
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Figure 2 : Comparison of the cross validation BAC scores between given 
approximated strategy (two top rows sorting and discarding, two bottom 
ones binning), 7 hyperparameter of Dcs (x-axis), accepted error e (y-axis). 
Positive values (and corresponding red colors) represent decrease in BAC 
score while negative values and corresponding blue colors - increase after 
using approximated method. 


of iterations cannot be used as a measure of optimization speed but 
it says much about the complexity of the function being maximized. 
This seems to confirm our claim that approximation works similar to 
the regularization and thus it reduces small irregularities of the error 
surface due to the removal of small elements from the ip^ internal 
summation. 

Experiments also showed importance on the regularization tech¬ 
nique added to perform out of sphere optimization. During maxi¬ 
mization of Dcs ill sonar and german datasets, norms of v rapidly 
grew to over 1000 if we turn off this modification and still use CG/L- 
BFGS-B. As a result the optimization problem became extremely hard 
and we needed tens of thousands Dcs evaluation in order to converge. 
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method 

name 

bin 

CG 

Dcs 

dist 

L-BFGS-B 
bin Dcs dist 

australian 

4 

36 

22 

11 

39 

37 

breast-cancer 

4 

35 

8 

6 

39 

14 

diabetes 

3 

30 

20 

18 

36 

29 

fourclass 

4 

12 

10 

6 

15 

14 

german.numer 

7 

60 

32 

7 

58 

38 

heart 

3 

40 

19 

12 

34 

20 

ionosphere 

5 

600 

216 

18 

384 

152 

liver-disorders 

4 

30 

22 

22 

43 

30 

sonar 

4 

262 

115 

15 

139 

100 

splice 

4 

92 

26 

14 

65 

41 


Table 2: Number of optimization methods’ iterations. 


Adding regularizing term reduced the norm to nearly 1 and number 
of required function calls by two orders of magnitude. 


6 Conclusions 

In this paper we proposed two simple approximation schemes for faster 
computation of MELC objective function and its gradient. We proved 
that in order to achieve constant error bound during optimization one 
needs a specific adaptive strategy for each of them and gave a simple, 
closed form equations for setting required parameters based on the 
user-specified acceptable level of error in the ip^ function value. We 
also showed how one can easily change the objective function in order 
to use wide range of existing optimizers while at the same time still 
work near the unit sphere which, as described in the MELC theory [5], 
is important from the numerical point of view. 

During extensive evaluation we confirmed that such approach is 
valid in terms of reducing the mean number of exp calls by even an 
order of magnitude while not sacrificing the resulting classifiers ac¬ 
curacy. In fact the experiments suggest that proposed method acts 
like some kind of regularization which might not only simplify the 
optimization problem but also slightly increase the obtained results. 
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